This report will be investigating loan data obtained by Prosper, a marketplace lending platform.

Data Set and Variables Section

## [1] 113937     81

The data set consists of 113,937 observations of 81 variables.

## 
##  factor integer numeric 
##      20      30      31
## 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ CreditGrade                        : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ClosedDate                         : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation                         : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus                   : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CurrentlyInGroup                   : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
##  $ GroupKey                           : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
##  $ DateCreditPulled                   : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable                   : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ LoanOriginationQuarter             : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
##  $ MemberKey                          : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...
##                    ListingKey                        ListingCreationDate
##  17A93590655669644DB4C06:     6   2013-10-02 17:20:16.550000000:     6  
##  349D3587495831350F0F648:     4   2013-08-28 20:31:41.107000000:     4  
##  47C1359638497431975670B:     4   2013-09-08 09:27:44.853000000:     4  
##  8474358854651984137201C:     4   2013-12-06 05:43:13.830000000:     4  
##  DE8535960513435199406CE:     4   2013-12-06 11:44:58.283000000:     4  
##  04C13599434217079754AEE:     3   2013-08-21 07:25:22.360000000:     3  
##  (Other)                :113912   (Other)                      :113912  
##   CreditGrade                    LoanStatus                  ClosedDate   
##         :84984   Current              :56576                      :58848  
##  C      : 5649   Completed            :38074   2014-03-04 00:00:00:  105  
##  D      : 5153   Chargedoff           :11992   2014-02-19 00:00:00:  100  
##  B      : 4389   Defaulted            : 5018   2014-02-11 00:00:00:   92  
##  AA     : 3509   Past Due (1-15 days) :  806   2012-10-30 00:00:00:   81  
##  HR     : 3508   Past Due (31-60 days):  363   2013-02-26 00:00:00:   78  
##  (Other): 6745   (Other)              : 1108   (Other)            :54633  
##  ProsperRating..Alpha. BorrowerState                      Occupation   
##         :29084         CA     :14717   Other                   :28617  
##  C      :18345         TX     : 6842   Professional            :13628  
##  B      :15581         NY     : 6729   Computer Programmer     : 4478  
##  A      :14551         FL     : 6720   Executive               : 4311  
##  D      :14274         IL     : 5921   Teacher                 : 3759  
##  E      : 9795                : 5515   Administrative Assistant: 3688  
##  (Other):12307         (Other):67493   (Other)                 :55456  
##       EmploymentStatus IsBorrowerHomeowner CurrentlyInGroup
##  Employed     :67322   False:56459         False:101218    
##  Full-time    :26355   True :57478         True : 12719    
##  Self-employed: 6134                                       
##  Not available: 5347                                       
##  Other        : 3806                                       
##               : 2255                                       
##  (Other)      : 2718                                       
##                     GroupKey                 DateCreditPulled 
##                         :100596   2013-12-23 09:38:12:     6  
##  783C3371218786870A73D20:  1140   2013-11-21 09:09:41:     4  
##  3D4D3366260257624AB272D:   916   2013-12-06 05:43:16:     4  
##  6A3B336601725506917317E:   698   2014-01-14 20:17:49:     4  
##  FEF83377364176536637E50:   611   2014-02-09 12:14:41:     4  
##  C9643379247860156A00EC0:   342   2013-09-27 22:04:54:     3  
##  (Other)                :  9634   (Other)            :113912  
##         FirstRecordedCreditLine         IncomeRange    IncomeVerifiable
##                     :   697     $25,000-49,999:32192   False:  8669    
##  1993-12-01 00:00:00:   185     $50,000-74,999:31050   True :105268    
##  1994-11-01 00:00:00:   178     $100,000+     :17337                   
##  1995-11-01 00:00:00:   168     $75,000-99,999:16916                   
##  1990-04-01 00:00:00:   161     Not displayed : 7741                   
##  1995-03-01 00:00:00:   159     $1-24,999     : 7274                   
##  (Other)            :112389     (Other)       : 1427                   
##                     LoanKey                LoanOriginationDate
##  CB1B37030986463208432A1:     6   2014-01-22 00:00:00:   491  
##  2DEE3698211017519D7333F:     4   2013-11-13 00:00:00:   490  
##  9F4B37043517554537C364C:     4   2014-02-19 00:00:00:   439  
##  D895370150591392337ED6D:     4   2013-10-16 00:00:00:   434  
##  E6FB37073953690388BC56D:     4   2014-01-28 00:00:00:   339  
##  0D8F37036734373301ED419:     3   2013-09-24 00:00:00:   316  
##  (Other)                :113912   (Other)            :111428  
##  LoanOriginationQuarter                   MemberKey     
##  Q4 2013:14450          63CA34120866140639431C9:     9  
##  Q1 2014:12172          16083364744933457E57FB9:     8  
##  Q3 2013: 9180          3A2F3380477699707C81385:     8  
##  Q2 2013: 7099          4D9C3403302047712AD0CDD:     8  
##  Q3 2012: 5632          739C338135235294782AE75:     8  
##  Q2 2012: 5061          7E1733653050264822FAA3D:     8  
##  (Other):60343          (Other)                :113888

Therefore, we can see that the Prosper loan data contains 113,937 observations and 81 different variables. Out of these variables, it is possible to see that exactly 20 of them are factors (categorical variables).


Univariate Plots Section

##  Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...

Ordering by largest to smallest counts:

## [1] ""   "A"  "AA" "B"  "C"  "D"  "E"  "HR" "NC"
## 
##           A    AA     B     C     D     E    HR    NC 
## 84984  3315  3509  4389  5649  5153  3289  3508   141

There appears to be a large number of observations (84,984) in an unnamed category. Removing this category:

There appears to be very little observations with a “NC” credit grade. As they have a similar amount of counts and are at opposite ends of the spectrum, comparing “AA” ratings with “HR” (High Risk) ratings in the bivariate section could provide some interesting results.

##  int [1:113937] 36 36 36 36 36 60 36 36 36 36 ...

Term appears to be split into three discrete levels.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   36.00   36.00   40.83   36.00   60.00
## 
##    12    36    60 
##  1614 87778 24545

Converting the “Term” variable into a categorical variable:

## 
##         12         36         60 
## 0.01416572 0.77040821 0.21542607

A term of 36 months appears to be far the most common while 12 month loans appear to be quite rare. It might be interesting to investigate the difference between the different terms; I would suspect income range, amount of debt, and credit grade may be significant here (but I may be wrong!).

##  num [1:113937] 0.165 0.12 0.283 0.125 0.246 ...

This has a wide distribution with a noticeable peak just above 0.35%.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00653 0.15629 0.20976 0.21883 0.28381 0.51229      25

Decreasing the binwidth and zooming in on the bulk of the data slightly:

It is possible to find the prominent peak just above 0.35% by calculating the mode:

## [1] 0.3579675
##  int [1:113937] 0 2 0 16 2 1 1 2 7 7 ...

Reordering by count, largest to smallest, and adding a cumulative bar plot:

## 
##            0            1            2            3            4 
## 0.1488980753 0.5117564970 0.0652378069 0.0630962725 0.0210203885 
##            5            6            7            8            9 
## 0.0066352458 0.0225738785 0.0921035309 0.0017465792 0.0007460263 
##           10           11           12           13           14 
## 0.0007986870 0.0019045613 0.0005178300 0.0175184532 0.0076884594 
##           15           16           17           18           19 
## 0.0133582594 0.0026681412 0.0004563926 0.0077674504 0.0067405672 
##           20 
## 0.0067668975

Listing category 1 (debt consolidation) is listed as the reason for over half of the loans. It can also be seen that the first four categories (~20% of them) account for over 80% of the total loan types. It may be interesting to see if some loan types come with certain conditions (i.e. term length).

##  Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...

Ordering by count, highest to lowest:

## 
##                    Employed     Full-time Not available  Not employed 
##          2255         67322         26355          5347           835 
##         Other     Part-time       Retired Self-employed 
##          3806          1088           795          6134

There are 2255 entries assigned to an unnamed category.

## 
##                    Employed     Full-time Not available  Not employed 
##   0.019791639   0.590870393   0.231312041   0.046929443   0.007328611 
##         Other     Part-time       Retired Self-employed 
##   0.033404425   0.009549137   0.006977540   0.053836769

A large proportion of people are listed as employed or full-time. I would suspect being listed as “Not employed” would be detrimental towards receiving a loan but it would be good to check this by comparing it to those listed as “Employed”.

##  int [1:113937] 2 44 NA 113 44 82 172 103 269 269 ...

Applying a square-root transformation, due to the strong right skew, to generate a more normal distribution:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   2.167   5.583   8.006  11.417  62.917    7625

Here, the x-axis has been converted to years (instead of months), with a typical employment length likely to be between 5.6-8 years.

##  Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...

Proportion of homeowners:

## 
##     False      True 
## 0.4955282 0.5044718

The proportion of people who do and don’t own homes is very evenly split - almost 50:50. It would be interesting to look for differences in loan typed between these two groups.

##  int [1:113937] 5 14 NA 5 19 21 10 6 17 17 ...

This appears to be a normal distribution with a moderate amount of right-skew.

Decreasing the binwidth, applying a square-root transformation to decrease the amount of skew, and zooming in slightly:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    7.00   10.00   10.32   13.00   59.00    7604

The number of current credit lines can be seen to take integer values with a typical value of 10 as seen from the median and mean.

##  int [1:113937] 1 13 0 7 6 13 6 5 12 12 ...

Here there is almost normal distribution with moderate right-skew present. Again, applying a square root transformation to generate a more normal distribution and zooming in to exclude outliers:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    4.00    6.00    6.97    9.00   51.00

A typical number of revolving accounts is likely to be around 6 or 7 from looking at mean and median values.

I would assume that the current number of credit lines or revolving accounts would have some impact on what type of loan a person could receive. Comparing these variables with loan amount or APR could confirm this.

##  int [1:113937] 3 3 0 0 1 0 0 3 1 1 ...

This is a strongly right-skewed distribution due to some extreme outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   1.000   1.435   2.000 105.000     697

Zooming in to exclude extreme outliers:

The number of inquiries in the last 6 months is in integer values and has the appearance of exponentially decreasing.

## 
##   (-0.1,2]     (2,10]   (10,105] 
## 0.82177676 0.16653126 0.01169198

The majority of people (over 80%) had made two or less inquiries in the last 6 months.

##  int [1:113937] 4 0 0 14 0 0 0 0 0 0 ...

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   4.155   3.000  99.000     990
## 
##    (-1,0.9]    (0.9,75]     (75,99] 
## 0.676768750 0.320902724 0.002328526

This distribution is strongly right-skewed seen by the very long right tail, and has a large number of observations with zero delinquencies in the last 7 years. It might make more sense to split this into two groups: people with no delinquencies, and people who have had one or more delinquencies.

Filtering out zero values and zooming in on the data to exclude outliers:

Therefore, the distribution appears to be exponentially decreasing, with the peak still clustered towards zero.

Calculating summary statistics for cases where “DelinquenciesLast7Years” does not equal zero:

##  DelinquenciesLast7Years
##  Min.   : 1.00          
##  1st Qu.: 3.00          
##  Median : 8.00          
##  Mean   :12.85          
##  3rd Qu.:17.00          
##  Max.   :99.00

The argument for two separate groups can be seen to be quite strong through the change in median and mean; both median and mean increased by around 8 delinquencies in the last seven years when people with zero delinquencies are treated as a separate group. From a banks’ point of view, having at least one delinquency may be sign your a likely to have another, or a least more likely to have another one compared to someone who has had none.

##  num [1:113937] 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0     880    4100   11210   13180  646285    7544
## # A tibble: 1 x 1
##       n
##   <int>
## 1  4881

This is a strongly right-skewed distribution due to some extreme outliers, and has a significant number of observations (4881) of available bankcard credit equal to zero (making a log transformation unsuitable).

Applying a square-root transformation to make the distribution more normal since most values are clustered towards zero:

Median and mean lines are also added to the plot to highlight the affect of the extreme outliers.

##  num [1:113937] 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...

## 
##     (0,2.5]   (2.5,9.9]  (9.9,10.5] 
## 0.995871455 0.001547018 0.002581527

Here we can can see a almost normal distribution with right-skew present combined with some anomalous looking outliers at a debt-to-income ration of 10

Filtering out extreme outliers:

Number of cases where “DebtToIncomeRatio” is equal to zero:

## # A tibble: 1 x 1
##       n
##   <int>
## 1    19

Countering right skew by applying a square-root transformation to counter the right skew and investigating the affect of the outliers on the mean of the distribution:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554
##  DebtToIncomeRatio
##  Min.   :0.0000   
##  1st Qu.:0.1400   
##  Median :0.2200   
##  Mean   :0.2419   
##  3rd Qu.:0.3100   
##  Max.   :1.4900

Lines have been added to highlight the change in mean due to including the outliers. It would be useful to know how this variable is related to a persons income and credit grade, and if it is a strong predictor of what loan type a person may receive.

##  Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...

Ordering from low to high income:

## 
##             $0      $1-24,999      $100,000+ $25,000-49,999 $50,000-74,999 
##    0.005450380    0.063842299    0.152163037    0.282542107    0.272519024 
## $75,000-99,999  Not displayed   Not employed 
##    0.148468013    0.067941055    0.007074085

Most people (~55%) appear to have an income between $25,000 and $74,999. Income is likely a strong factor for deciding if and what type of loan a person may be eligible for. Therefore, investigating what type of loans high earners (i.e. $100,000+) receive compared to lower incomes (i.e. $1-24,999) could provide some insightful information.

##  Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  False   True 
##   8669 105268
## 
##      False       True 
## 0.07608591 0.92391409

Most people (92%) have a verifiable income.

##  int [1:113937] 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...

This distribution appears to be moderately right-skewed with most observations closer clustered towards zero.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

A typical loan amount is likely between $6500 and $8337. The minimum loan amount appears to be $1000, and the maximum appears to be $35,000.

Changing the histogram binwidth to better inspect the structure of this distribution:

Large peaks at loan amounts $4000, $10,000, and $15,000 can be seen. Looking at the loan amount on a smaller scale, it looks like common loan amounts are multiples of $500.

As loan amount is probably the most important requirement for someone seeking a loan, I would very much like to know how these loan amounts are related to factors such as income, debt, credit grade, homeowner, etc., for potential customers.

##  Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...

This is time-series data; two peaks can be seen, one around 2008 and another around 2014.

Investigating “ListCreationDate” on a weekly scale:

It is clear that, instead of a peak at 2008, it is actually a trough during 2008, showing a sharp drop in activity. There also appears to be a dip in loan creations at the start of 2013.

Zooming in on region with very low values:

Therefore, we can see an anomalous period of very little activity (new loans), lasting around 9 months from around October 2008.

It might be possible to reveal trends over time by comparing this data with other variable, i.e. has the average APR increased or decreased over time?

Based upon the saying “money makes money”, I thought it would interesting to consider a new variable: LoanOriginalAmount/AvailableBankcardCredit. This variable can be used to see if a persons’ available bankcard credit determines how big a loan they can receive, i.e. above 1 indicates the loan is greater than their available bankcard credit and below 1 indicates the opposite.

This distribution is strongly right-skewed with a large peak near zero. Note, “+1” has been added to “AvailableBankcardCredit” to avoid non-finite values for when this variable equals zero.

Changing the binwidth and scaling by log10 to counter the strong skew:

##   mean_loa_abc median_loa_abc max_loa_abc min_loa_abc
## 1     14.14842       1.566171        9000 0.001547303
## 
##     (0,1] (1,9e+03] 
##  0.373977  0.626023

The result is, that most people (~62%), obtain a loan which is greater than the available bankcard credit they have. This is also shown by a median ratio of 1.56 (the mean is much higher due to the strong right-skew in the distribution). Still, a non-significant proportion of people (~37%) have available bankcard credit greater than the loan they obtained. It would be interesting to investigate the differences between these two groups. A smaller peak to the right of the main peak can also be seen in this distribution. This is there due to the “+1” that was added to the “AvailableBankcardCredit” variable, and indicates a group of people whose available bankcard credit is equal to zero. It would also be intriguing to see what type of loan this group has access to.

Univariate Analysis

What is the structure of your dataset?

  • “CreditGrade”

CreditGrade categories going from good to poor are: “AA” “A” “B” “C” “D” “E” “HR” “NC”. A large number of entries (84,984) are assigned to an unnamed category (“”). Excluding this category, the top three most common categories are “C”, “D”, and “B” in descending order.

  • “Term”

There appears to be only three different term lengths available: 12, 36, and 60 months, with proportions of approximately 1.5%, 77%, and 21.5% respectively.

  • “BorrowerAPR”

The mean and median Borrower APR are 0.21883% and 0.20976% respectively. This variable has a very broad almost normal distribution with a large spike in BorrowerAPR at approximately 0.3580 %. This could possible be to due a link to the inflation rate combined with a large number of people taking out loans at the same time causing a spike.

  • “ListingCategory..Numeric.”

Here we can see the listing categories follow the Pareto principle quite closely, as 20% of the listing categories are responsible for approximately 80% of the observations, with “1” corresponding to “Debt Consolidation” being the largest, accounting for over half the observations.

The listing categories are as follows: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans

  • “EmploymentStatus”

There again seems to be an unnamed category this time with 2255 observations. The majority of people have indicated either “Employed” or “Full-time” with proportions of 59% and 23% respectively.

  • “EmploymentStatusDuration”

After transformation, this variable has an almost normal distribution with a long right tail. This skew can be observed also by the difference between the median and mean which are approximately 5.6 and 8.0 years respectively.

  • “IsBorrowerHomeowner”

This is a binary categorical variable showing an almost 50-50 split for homeowners and non-homeowners.

  • “CurrentCreditLines”

This can be seen to be a categorical variable with an almost normal distribution and a moderate right skew. It has a mean and median of approximately 10 current credit lines.

  • “OpenRevolvingAccounts”

This variable is very similar to the “CurrentCreditLines” variable but slightly more skewed with a mean and median of approximately 7 and 6 accounts respectively.

  • “InqueriesLast6Months”

This variable is extremely right-skewed. Most results are clustered around zero, with over 80% of people reporting 3 inquiries or less.

  • “DelinquenciesLast7Years”

The variable “DelinquenciesLast7Years” is highly right skewed due to outliers with quite extreme values. The majority of people (~ 67%) do not have delinquencies in the last 7 years. Out of those those that have at least one delinquency, the median and mean number of delinquencies is 8 and 13 respectively (rounding upward for the mean), compared to 0 and 4 delinquencies for when these values are not excluded.

  • “AvailableBankcardCredit”

This is a strongly right-skewed distribution with extreme outliers. It also contains a large spike at zero. Due to the high amount of skew, there is a large difference between the median and mean with values at $4100 and $11,210 respectively.

  • “DebtToIncomeRatio”

This variable appears to be approximately normal with moderate right-skew alongside some extreme outliers, particularly at a DebtToIncomeRatio of 10. These high values may possibly be due to students leaving university, just starting their careers, with very large amounts of debt. Excluding these outliers has the effect of changing the mean from 0.276 to 0.242, i.e. reducing the amount of skew.

  • “IncomeRange”

This is a categorical variable with levels (going from low to high for dollar amounts): $0; $1-24,999; $25,000-49,999; $50,000-74,999; $75,000-99,999; $100,000+ ; Not displayed; Not employed

  • “IncomeVerifiable”

This is a categorical variable with two levels. At 92%, the vast majority of people have a verifiable income.

  • “LoanOriginalAmount”

The overall distribution is right-skewed and, from loan amount observations, spikes can be seen for certain loan amounts. Particularly prominent loan amounts appear to be for loans of: $4000, $10,000, and $15,000.

  • “ListingCreationDate”

This variable has a date time format. Dropping the time (H:M:S) part of the observations and plotting a histogram allows us to see the activity of the bank on a more suitable scale. There is a clear cease in activity from around October 2008 to July 2009 which is most likely due to the economic recession that occurred around that time. An increase in loan activity can also seen to be peaking around the start of 2014.

What is/are the main feature(s) of interest in your dataset?

I would like to explore the features that a potential customer looking for a loan would care about, i.e. What loan amount could I get? What interest could I get on such a loan? How long will I have to pay the loan back? Therefore my main features are LoanOriginalAmount, BorrowerAPR, and Term.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Features which are likely to influence the above variables include: IncomeRange, DebtToIncomeRatio, CreditGrade, AvailableBankcardCredit, CurrentCreditLines, and IsBorrowerHomeowner. These are also features that the customer themselves would know, and would therefore be useful answering questions like: how does my income range/credit grade affect what amount of loan/APR?

Did you create any new variables from existing variables in the dataset?

After plotting the “Term” variable, it was apparent that only three different terms are available. Therefore, I created a new variable which is a categorical version of this variable as it only has discrete values and this will be more useful for comparing against other variables.

A new variable, LoanOriginalAmount/(AvailableBankcardCredit+1), was also created. This was to investigate what proportion of people have available bankcard credit that is greater or smaller compared to the loan they receive.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I log transformed or square-root transformed multiple variables which contained a moderate to strong amount of skew in order to produce a more normal distribution. I also noted more than one categorical variable with an unnamed level i.e. the unnamed category for “CreditGrade” was removed to visualise the data more clearly. These levels will most likely also be removed during the bivariate analysis since it is likely no significance can be attached to them.

The “BorrowerAPR” distribution contains an odd peak at around 0.35% which, as of yet, has no clear origin. The “DebtToIncomeRatio” variable also has an increase in some extreme outliers at a ratio of exactly 10, which is quite strange and also quite hard to explain. It is possible this was the maximum value that was available to pick when entering this ratio, hence the increase of observations at this value.


Bivariate Plots Section

Creating a pairwise plot of (what I have deemed) the most important variables:

From this plot, it is possible to see that a lot of the variables are poorly correlated (i.e. LoanOriginalAmount vs DebtToIncomeRatio with a value of 0.01), with low correlation values. This section will look at variable pairings with the higher correlation values.

Investigating first factors linked with “LoanOriginalAmount”; looking at loan amount and available bankcard credit and also zooming in to exclude outliers:

There appears to be a high volume of loans for under $10,000 for people with a lower AvailableBankcardCredit. There also appears to be a slight negative correlation in this region which contradicts the positive value in the pairwise plot. This is likely due to the large amount of loans at multiples of $5000 which seem independent of available bankcard credit.

From the correlation matrix, the correlation value is 0.2, and from looking at the plot, a slight positive correlation can be seen. It would be interesting to check what income and credit grade this trend can be broken down into.

A negative correlation can be observed here; APR seems to increase as the available credit decreases.

Here a slight trend can be observed where, the higher loan amount, the larger there is a proportion of people from higher income brackets.

A similar trend to the last plot can be seen here: lower loan amounts seem much more common for people with poorer credit grades and higher loan amounts have a larger proportion of people with better credit grades.

It appears that the lower the income bracket, the more likely you are to be given a higher APR.

From this plot, there is a clear relationship between APR and credit grade; the better the credit grade, the better the APR.

The median loan amount can be seen to increase as income increases, with only a slight increase in variation.

In general, the median loan amount appears to increase the better the credit grade. The one exception to this appears to be the “AA” rating which breaks the trend. This credit does have the highest variance however and it is possible that with more data we would see this trend continue.

It appears the the larger the loan, the more likely you are to have a longer term length.

It appears, from this plot, that the APR will increase the poorer a credit rating someone has.

In general, APR appears to decrease for people with high incomes. There seems to be one exception; for people with $0 incomes, the APR is strangely low. It is possible that $0 incomes refer to students who may be allowed special, student only, deals which give them low APRs.

The APR does not seem to be greatly affected by the term of the loan. It seems, however, that the APR of 60 month loans are less likely to fluctuate as much as other terms as seen from the smaller variance from this plot.

It appears that a longer term is associated with a higher debt to income ratio. This would make sense as a lower ratio would imply a higher disposable income allowing for a loan to be payed back quicker, therefore, requiring less time to pay it back.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Against my intuition, the loan amount was not very strongly correlated to the debt to income ratio. A slight negative correlation between loan amount and available bankcard credit was observed. However, a few strong relationships were apparent from these plots; as credit grade improved, APR appears to decrease while the median loan amount increases. Additionally, as the income bracket increased, the median loan amount appeared to increase while the APR seemed to decrease. It was also apparent the the term of the loan did not have much effect on the APR but a longer term does seem to be more common with a larger loan.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

It appears that people with a lower debt to income ratio require less time (i.e. shorter terms) to pay back their loans.

What was the strongest relationship you found?

In terms of correlation between two numerical variables, the strongest relationship was between BorrowerAPR and AvailableBankcardCredit with a correlation value of -0.349. However, the categorical variables such as IncomeRange and CreditGrade were more insightful as clear changes in the median values of for LoanOriginalAmount and BorrowerAPR could be seen for different levels within these variables.


Multivariate Plots Section

Investigating how LoanOriginalAmount, BorrowerAPR, and Term are related to the other variables of interest:

Lower credit grades appear to have a smaller range and variance of available bankcard credit as well as loan amounts.

Similar to the previous credit grade plot, it appears that the largest loan amounts (i.e. $30,000 or above) are mostly limited to people with the highest income grade ($100,000+). These people also seem more likely to have a large available bankcard credit compared to lower income ranges.

Although it is quite a weak correlation, it appears that for people with income ranges of $1-24,999 an increase in debt-to-income ratio limits the maximum amount of loan possible. This excludes people with a ratio of 10, where there seems to be an odd increase in the amounts of people with this type of income range who also manage to obtain relatively large loans.

There does not seem to be any new information to draw from this plot considering the previous two plots.

ggplot(data = loan_data , aes(x = LoanOriginalAmount/(AvailableBankcardCredit+1), color =  factor(IncomeRange,levels(IncomeRange)[c(7,8,1,2,4:6,3)]))) +
  geom_density(size =1) +
  scale_x_log10() +
  scale_color_brewer(palette = "RdYlBu", aes(name = "Income Range")) 
## Warning: Removed 7544 rows containing non-finite values (stat_density).

As it is hard to discern any significant change in the mean for different categories of income range when considering this variable, it is possible that variable has a week relationship with income range.

There appears to be a significant difference in loan amount for homeowners and non-homeowners. It would be interesting to see how “IsBorrowerHomeowner” affects other variables such as “BorrowerAPR”.

Wrapping the plots by “Term” gives some surprising results. There appears to be only one term length (36 months) for which credit grades are (properly) recorded. It is possible to look at this further by investigating how credit grade has changed over time:

It can be seen that in some point in 2008, the levels of credit grade ceased to be recorded. There also appears to be a relationship between credit grade and APR. Investigating BorrowerAPR further:

Lower loan APRs also seem more common with people who have higher available bankcard credit and income bracket.

There does not appear to be much of a correlation between APR and current credit lines. However, the distribution of current credit lines is interesting in that lower income brackets (i.e. below $25,000) appear to very rarely have over 10 current credit lines whereas 10 or more credit lines seems much more common for higher income brackets.

Higher loan amounts appear to have less varied and lower APRs.

Credit grade appears to be significantly connected to BorrowerAPR. Converting this plot to show mean APR values by credit grade:

A better credit grade appears to result in a lower BorrowerAPR.

The relationship from the previous plot can again be seen when investigating available bankcard credit. It is also noticeable that those with the highest available bankcard credit also appear to have the best credit rating.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

From these plots, it is clear that the highest loans and lowest APRs go to people with the best credit grades and highest incomes. It also appears that, people with lower income brackets and poorer credit ratings not only have smaller loans and higher APRs in general (clearly shown by the plot of BorrowerAPR vs LoanOriginalAmount split by CreditGrade), but also have fewer credit lines, available credit, and higher debt-to-income ratios. Owning a home also appears to be connected, in general, with having a higher income (and hence lower APRs) as well as receiving higher loan amounts compared to those who do not own a home.

Were there any interesting or surprising interactions between features?

There were some surprising results after investigating the relationship between the loan term and credit grade. Initially it appeared that a term of 36 months only consisted of properly labelled credit grades. Looking into to this further by checking how the credit grades changed over time, it is apparent that after some point in 2008, credit grades were no longer recorded. This is most likely linked to the drop in activity seen from the univariate exploration section, and is possible (probably) due to a change in regulation after the financial crash in 2008.


Final Plots and Summary

Plot One

Median values for loan amounts by income bracket:

##             $0      $1-24,999      $100,000+ $25,000-49,999 $50,000-74,999 
##           5000           4000          12000           5000           7500 
## $75,000-99,999  Not displayed   Not employed 
##           9700           3033           4000

Description One

In this plot, the distribution of loan amount is split by income bracket; the median loan amount appears to increase as income bracket increases. The highest median loan amount is for the $100,000+ bracket, with a median of $12,000. This bracket also has the highest variance, due to this same bracket containing more outliers with high loan amounts. The lowest median amount is for those in the “Not displayed” bracket at $3033. The $0 dollar bracket ha s a higher than expected loan amount when compared to neighboring levels which may be due to special circumstances for people classed in this bracket. It also has quite a large variance compared to it’s neighbors as well, however, both these factors may be due to the small proportions of people in this bracket (as seen in the univariate exploration section).

Plot Two

Description Two

This shows how “BorrowerAPR” evolves over time by plotting it against the “LoanCreationDate” and it also includes the running mean for APR for each category of credit grade. There are a few noticeable and insightful points that can be gleaned from this plot; firstly, the borrower APR can be seen to be split by credit grade, with better grades being given a lower APR and vice versa. Secondly, there is a period which starts in 2008 from which almost no credit grade information appears to have been recorded. During this period, it looks as if credit grade information ceased to be recorded as from around half way through 2009 onwards, all credit grade information was assigned to an unnamed category. This change in procedure occurs is likely linked to the financial crash in 2008.

Plot Three

Description Three

This plot shows the APR against available bankcard credit by income range split into a homeowners plot and a non-homeowners plot. I thought splitting the plot by homeowner might be insightful as the sample is made up of almost exactly halve homeowners, halve non-homeowners. Each plot shows the same trend; the higher the income bracket, the lower the APR, as seen by the grouping of orange points (bracket $25,000 - 49,999) to the top and pink points ($ 100,000+) to the bottom left. In addition to this, the proportion of the highest income bracket ($100,000) appears to be much higher for homeowners, seen by the grouping of pink points in the lower left of the “True” plot for “IsBorrowerHomeowner”, compared to non-homeowners. There also seems to be a higher proportion of the income bracket $25,000-49,999 for non-homeowners as seen by the prevalence of orange points in this plot.


Reflection

This data set consists of 113,937 observations of 81 variables. Due to the high number of features, one of the challenges of this project was choosing which variables, and combinations of variables, to investigate out of all the possible choices for fear of missing something important. From tackling this project, it is clear that for any future work involving a high number of variables will require smart feature selection and, ideally, some amount of domain knowledge. Due to my lack of domain knowledge for the marketplace lending industry, I decided to investigate a large number of variables in the univariate exploration phase of this project. In the analysis phase of this section, my method was then to pick variables which a customer might want to know about when applying for a loan i.e. How does owning a home affect my potential loan interest rate? Following on from this approach, I would reason that loan amount and loan APR are some of the most important variables in this regard. In the following sections I then explored the relationships between these variables and other variables of interest, exploring new questions that arose from the results of each new observation/lead.

Among discovering numerous points of interest while exploring the data, it was clear that loan amount and loan APR appear to be strongly affected by income range and credit grade. In summary, the higher the income range and/or the better the credit rating, you are likely to be be awarded a larger loan and/or lower APR. From utilisation time-series data from the data set (specifically the “LoanCreationDate” variable), it was possible to see a drop in loan generation in 2008, most likely due to the economic recession that occurred around that time. This seems to have had a significant impact on the industry as further exploration also revealed that credit grade data was no longer recorded after this point.

As the most recent data in this data set is from 2014, it is possible that some of the observed trends that were observed do not transfer to current trends in this industry. With regard to the credit grade data, since a large proportion of this data set does not have a category, it is also possible that the observed trends for this variable may not hold when a larger sample size is considered. It is worth noting that since no statistical tests have been carried out, any of the above relationships that have been hypothesised between the variables cannot be declared as statistically significant until this is carried out. For further investigation, it would be very interesting to see similar information for people who applied but were declined a loan, and from this, establish a model which might predict whether someone is likely to be awarded their requested loan (amount and APR) or not.